Adil Yıldız
IT Expert - Data Science Enthusiast - Technophile
I am a continuous learner, reliable, dedicated and hardworking IT Expert with 15+ years of experience. I have worked for the leading companies, having different roles. I am skilled mainly in devops, highly available service management, technical project management, data science and product management. I am able to handle multiple tasks on a daily basis, working under the pressure.
This page is dedicated to my tiny steps in data science. I have a degree in this field and i am trying to extend my knowledge with real life applications. I will share here mainly anything useful, practical and appliable. Please contact with me for any positive or negative feedback or questions if you have.
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
The training data for this project are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har. If you use the document you create for this class for any purpose please cite them as they have been very generous in allowing their data to be used for this kind of assignment.
The goal of this study is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. Any of the other variables may be used to predict with. I will explain step by step how i built my model, how i used cross validation, what i think the expected out of sample error is, and why i made the choices i did. I will also use the decided prediction model to predict 20 different test cases at the end of the study.
After the preprocessing of the data we will have 3 data sets on hand: training, testing and pml_testing. I will use training to train the model. The testing set will actually be used for model selection(validation). Pml_testing set will be used for making the predictions with the selected model.
setwd("C:/Users/Acer-nb/Downloads/ML_Project")
# read csv data with na and blank values set to na
pml_training <- read.csv("pml-training.csv",na.strings=c("NA","NaN", " ",""))
pml_testing <- read.csv("pml-testing.csv",na.strings=c("NA","NaN", " ",""))
# remove columns with na values
pml_train <- pml_training[,!colSums(is.na(pml_training))>0]
pml_test <- pml_testing[,!colSums(is.na(pml_testing))>0]
dropcl <- grep("name|timestamp|window|X", colnames(pml_train), value=F)
pml_training <- pml_train[,-dropcl]
dropcl <- grep("name|timestamp|window|X", colnames(pml_test), value=F)
pml_testing <- pml_test[,-dropcl]
#
set.seed(1234)
inTrain <- createDataPartition(y=pml_training$classe,p=0.8, list=FALSE)
training <- pml_training[inTrain,]
testing <- pml_training[-inTrain,]
After removing unrequired columns there are 52 predictors in the data. This is quite a lot, so we may not need all of them in our model.
# Make a matrix of correlations of all predictors
M <- abs(cor(training[,-53]))
# Set the diagonal to zero (the correlation of a predictor with itself, it's 1, we know, so we should remove it)
diag(M) <- 0
# Find the parameters having correlation over a threshold.
which(M > 0.8,arr.ind=T)
## row col
## yaw_belt 3 1
## total_accel_belt 4 1
## accel_belt_y 9 1
## accel_belt_z 10 1
## accel_belt_x 8 2
## magnet_belt_x 11 2
## roll_belt 1 3
## roll_belt 1 4
## accel_belt_y 9 4
## accel_belt_z 10 4
## pitch_belt 2 8
## magnet_belt_x 11 8
## roll_belt 1 9
## total_accel_belt 4 9
## accel_belt_z 10 9
## roll_belt 1 10
## total_accel_belt 4 10
## accel_belt_y 9 10
## pitch_belt 2 11
## accel_belt_x 8 11
## gyros_arm_y 19 18
## gyros_arm_x 18 19
## magnet_arm_x 24 21
## accel_arm_x 21 24
## magnet_arm_z 26 25
## magnet_arm_y 25 26
## accel_dumbbell_x 34 28
## accel_dumbbell_z 36 29
## gyros_dumbbell_z 33 31
## gyros_forearm_z 46 31
## gyros_dumbbell_x 31 33
## gyros_forearm_z 46 33
## pitch_dumbbell 28 34
## yaw_dumbbell 29 36
## gyros_forearm_z 46 45
## gyros_dumbbell_x 31 46
## gyros_dumbbell_z 33 46
## gyros_forearm_y 45 46
As seen in the results some variables are correlated, which means some of them should not exist in the model. The feature selection should be performed and the selected fewer features should be used to construct the models.
The following exploratory graphs shows that some features are quite helpful to seperate the classes, so we may expect high accuracy levels when predicting the outcome.
plot_ly(training,color =training$classe,y=training[,1],type="box")
plot_ly(training,color =training$classe,y=training[,3],type="box")
The following models have been set with 25 features extracted by PCA from the data. Even after this kind of a feature selection method, the random forest model takes too long to run.
# Create as many components as required to explain %95 of the variance
preProc <- preProcess((training[,-53]+1),method=c("center","scale","pca"),thresh = 0.95)
trainPC <- predict(preProc,(training[,-53]+1))
testPC <- predict(preProc,(testing[,-53]+1))
mod1 <- train(x=trainPC, y=training$classe, method="lda")
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:plotly':
##
## select
mod2 <- train(x=trainPC, y=training$classe, method="knn")
pred1 <- predict(mod1,testPC)
pred2 <- predict(mod2,testPC)
Even after simplifying with PCA the models takes too long to run, so i tried to perform another method to make the models simpler.
The earth package implements variable importance based on Generalized cross validation (GCV), number of subset models the variable occurs (nsubsets) and residual sum of squares (RSS). I tried this method on the data as follows:
marsModel <- earth(classe~., data=training)
ev <- evimp (marsModel,trim = FALSE)
ev
## nsubsets gcv rss
## roll_belt 21 100.0 100.0
## magnet_dumbbell_y 20 90.6 90.6
## roll_forearm 19 85.6 85.6
## accel_belt_z 17 72.8 72.9
## magnet_dumbbell_z 16 70.0 70.1
## yaw_belt 15 66.0 66.1
## roll_dumbbell 13 56.2 56.4
## total_accel_dumbbell 12 51.9 52.1
## pitch_belt 10 43.5 43.7
## pitch_forearm 7 31.9 32.1
## total_accel_belt-unused 0 0.0 0.0
## gyros_belt_x-unused 0 0.0 0.0
## gyros_belt_y-unused 0 0.0 0.0
## gyros_belt_z-unused 0 0.0 0.0
## accel_belt_x-unused 0 0.0 0.0
## accel_belt_y-unused 0 0.0 0.0
## magnet_belt_x-unused 0 0.0 0.0
## magnet_belt_y-unused 0 0.0 0.0
## magnet_belt_z-unused 0 0.0 0.0
## roll_arm-unused 0 0.0 0.0
## pitch_arm-unused 0 0.0 0.0
## yaw_arm-unused 0 0.0 0.0
## total_accel_arm-unused 0 0.0 0.0
## gyros_arm_x-unused 0 0.0 0.0
## gyros_arm_y-unused 0 0.0 0.0
## gyros_arm_z-unused 0 0.0 0.0
## accel_arm_x-unused 0 0.0 0.0
## accel_arm_y-unused 0 0.0 0.0
## accel_arm_z-unused 0 0.0 0.0
## magnet_arm_x-unused 0 0.0 0.0
## magnet_arm_y-unused 0 0.0 0.0
## magnet_arm_z-unused 0 0.0 0.0
## pitch_dumbbell-unused 0 0.0 0.0
## yaw_dumbbell-unused 0 0.0 0.0
## gyros_dumbbell_x-unused 0 0.0 0.0
## gyros_dumbbell_y-unused 0 0.0 0.0
## gyros_dumbbell_z-unused 0 0.0 0.0
## accel_dumbbell_x-unused 0 0.0 0.0
## accel_dumbbell_y-unused 0 0.0 0.0
## accel_dumbbell_z-unused 0 0.0 0.0
## magnet_dumbbell_x-unused 0 0.0 0.0
## yaw_forearm-unused 0 0.0 0.0
## total_accel_forearm-unused 0 0.0 0.0
## gyros_forearm_x-unused 0 0.0 0.0
## gyros_forearm_y-unused 0 0.0 0.0
## gyros_forearm_z-unused 0 0.0 0.0
## accel_forearm_x-unused 0 0.0 0.0
## accel_forearm_y-unused 0 0.0 0.0
## accel_forearm_z-unused 0 0.0 0.0
## magnet_forearm_x-unused 0 0.0 0.0
## magnet_forearm_y-unused 0 0.0 0.0
## magnet_forearm_z-unused 0 0.0 0.0
As the model recommends only 10 of the variables have effect on the outcome. So i made the subset of the data with these 10 variables.
training_imp <- subset(training,select = c(classe,roll_belt,magnet_dumbbell_y,roll_forearm,accel_belt_z,magnet_dumbbell_z,yaw_belt,roll_dumbbell,total_accel_dumbbell,pitch_belt,pitch_forearm))
testing_imp <- subset(testing,select = c(classe,roll_belt,magnet_dumbbell_y,roll_forearm,accel_belt_z,magnet_dumbbell_z,yaw_belt,roll_dumbbell,total_accel_dumbbell,pitch_belt,pitch_forearm))
pml_testing_imp <- subset(pml_testing,select = c(roll_belt,magnet_dumbbell_y,roll_forearm,accel_belt_z,magnet_dumbbell_z,yaw_belt,roll_dumbbell,total_accel_dumbbell,pitch_belt,pitch_forearm))
I built 4 models with this subset.
mod11 <- train(classe~.,data=training_imp,method="rf",preProcess=c("center","scale"))
mod12 <- bagging(classe~.,data=training_imp,preProcess=c("center","scale"))
mod13 <- C5.0(classe~.,data=training_imp,preProcess=c("center","scale"))
mod14 <- train(classe~.,data=training_imp,method="knn",preProcess=c("center","scale"))
pred11 <- predict(mod11,testing)
pred12 <- predict(mod12,testing)
pred13 <- predict(mod13,testing)
pred14 <- predict(mod14,testing)
confusionMatrix(predict(mod14,testing_imp),testing_imp$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1096 30 5 2 0
## B 9 671 18 7 10
## C 6 23 629 26 12
## D 3 31 26 605 12
## E 2 4 6 3 687
##
## Overall Statistics
##
## Accuracy : 0.9401
## 95% CI : (0.9322, 0.9473)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9242
## Mcnemar's Test P-Value : 2.215e-05
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9821 0.8841 0.9196 0.9409 0.9528
## Specificity 0.9868 0.9861 0.9793 0.9780 0.9953
## Pos Pred Value 0.9673 0.9385 0.9037 0.8936 0.9786
## Neg Pred Value 0.9928 0.9726 0.9830 0.9883 0.9894
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2794 0.1710 0.1603 0.1542 0.1751
## Detection Prevalence 0.2888 0.1823 0.1774 0.1726 0.1789
## Balanced Accuracy 0.9844 0.9351 0.9495 0.9595 0.9741
confusionMatrix(predict(mod13,testing_imp),testing_imp$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1090 11 8 1 1
## B 21 721 8 3 6
## C 4 15 651 11 6
## D 1 8 14 626 3
## E 0 4 3 2 705
##
## Overall Statistics
##
## Accuracy : 0.9669
## 95% CI : (0.9608, 0.9722)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9581
## Mcnemar's Test P-Value : 0.2972
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9767 0.9499 0.9518 0.9736 0.9778
## Specificity 0.9925 0.9880 0.9889 0.9921 0.9972
## Pos Pred Value 0.9811 0.9499 0.9476 0.9601 0.9874
## Neg Pred Value 0.9908 0.9880 0.9898 0.9948 0.9950
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2778 0.1838 0.1659 0.1596 0.1797
## Detection Prevalence 0.2832 0.1935 0.1751 0.1662 0.1820
## Balanced Accuracy 0.9846 0.9690 0.9703 0.9828 0.9875
confusionMatrix(predict(mod12,testing_imp),testing_imp$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1106 8 2 0 3
## B 3 732 4 2 5
## C 5 10 672 6 3
## D 1 7 6 635 3
## E 1 2 0 0 707
##
## Overall Statistics
##
## Accuracy : 0.9819
## 95% CI : (0.9772, 0.9858)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.9771
## Mcnemar's Test P-Value : 0.05179
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9910 0.9644 0.9825 0.9876 0.9806
## Specificity 0.9954 0.9956 0.9926 0.9948 0.9991
## Pos Pred Value 0.9884 0.9812 0.9655 0.9739 0.9958
## Neg Pred Value 0.9964 0.9915 0.9963 0.9976 0.9956
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2819 0.1866 0.1713 0.1619 0.1802
## Detection Prevalence 0.2852 0.1902 0.1774 0.1662 0.1810
## Balanced Accuracy 0.9932 0.9800 0.9875 0.9912 0.9898
confusionMatrix(predict(mod11,testing_imp),testing_imp$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1112 3 1 0 0
## B 2 743 5 0 2
## C 2 7 677 4 0
## D 0 6 1 639 1
## E 0 0 0 0 718
##
## Overall Statistics
##
## Accuracy : 0.9913
## 95% CI : (0.9879, 0.994)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.989
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9964 0.9789 0.9898 0.9938 0.9958
## Specificity 0.9986 0.9972 0.9960 0.9976 1.0000
## Pos Pred Value 0.9964 0.9880 0.9812 0.9876 1.0000
## Neg Pred Value 0.9986 0.9950 0.9978 0.9988 0.9991
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2835 0.1894 0.1726 0.1629 0.1830
## Detection Prevalence 0.2845 0.1917 0.1759 0.1649 0.1830
## Balanced Accuracy 0.9975 0.9880 0.9929 0.9957 0.9979
All models performed quite satisfactory, but the winner was mod14, built with random forest with 0.99 accuracy.
Here are the results with the small test data of 20 observations:
predict(mod14,pml_testing)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
All test activities are predicted correctly, so the model works well.
The classes in testing data are predicted with 99% accuracy, which is almost a perfect score. The other performance metrics are also very high. This means that we have a very small out-of-sample error rate, but a question raises here, is there an over-fitting issue? There should not be. The number of observations are quite sufficient and i do not expect to observe very different variations in real life.
Use this area of the page to describe your project. The icon above is part of a free icon set by Flat Icons. On their website, you can download their free set with 16 icons, or you can purchase the entire set with 146 icons for only $12!
Use this area of the page to describe your project. The icon above is part of a free icon set by Flat Icons. On their website, you can download their free set with 16 icons, or you can purchase the entire set with 146 icons for only $12!
Use this area of the page to describe your project. The icon above is part of a free icon set by Flat Icons. On their website, you can download their free set with 16 icons, or you can purchase the entire set with 146 icons for only $12!
Use this area of the page to describe your project. The icon above is part of a free icon set by Flat Icons. On their website, you can download their free set with 16 icons, or you can purchase the entire set with 146 icons for only $12!
Use this area of the page to describe your project. The icon above is part of a free icon set by Flat Icons. On their website, you can download their free set with 16 icons, or you can purchase the entire set with 146 icons for only $12!